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Abstract. In this work, we propose D®-Tree, a dynamic distributed 
deterministic structure for data management in decentralized networks. 

We present in brief the theoretical algorithmic analysis, in which our 
proposed structure is based on, and we describe thoroughly the key as¬ 
pects of the implementation. Gonducting experiments, we verify that 
the implemented structure outperforms other well-known hierarchical 
tree-based structures, since it provides better complexities regarding 
load-balancing operations. More specifically, the structure achieves an 
0(log A) amortized bound {N is the number of nodes present in the net¬ 
work), using an efficient deterministic load-balancing mechanism, which 
is general enough to be applied to other hierarchical tree-based struc¬ 
tures. Moreover, we investigate the structure’s fault tolerance, which 
hasn’t been sufficiently tackled in previous work, both theoretically and 
through rigorous experimentation. We prove that D®-Tree is highly fault 
tolerant, since, even for massive node failures, it achieves a significant 
success rate in element queries. Afterwards we go one step further, in or¬ 
der to achieve sub-logarithmic complexity and propose the ART^ struc¬ 
ture (Autonomous Range Tree), exploiting the excellent performance of 
D®-Tree. ART"*" achieves an 0(logj log A) communication cost for query 
and update operations (6 is a double-exponentially power of 2 and A 
is the total number of peers). Moreover, ART"^ is a fully dynamic and 
fault-tolerant structure, which supports the join/leave node operations in 
0(log log A) expected w.h.p number of hops and performs load-balancing 
in O(loglogA) amortized cost. 

Keywords: decentralized system, distributed data structure, p2p, im¬ 
plementation, load-balancing, fault tolerance 


1 Introduction 

Decentralized systems and in particular Peer-to-Peer (P2P) networks, have gen¬ 
erated a lot of interest worldwide among the computer networking community. 
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Although they have actually existed for many years, they have become very pop¬ 
ular nowadays and are promoted as the future of Internet networking. They are 
widely used for sharing resources and store very large data sets, using systems 
of small computers instead of large costly servers. 

According to [5], a P2P network ”is a type of decentralized and distributed 
network architecture, in which, individual nodes in the network (called ’’peers”) 
act as both suppliers and consumers of resources, in contrast to centralized client- 
server model where client nodes request access to resources provided by central 
servers. In a peer-to-peer network, tasks (such as searching for files or streaming 
audio/video) are shared amongst multiple interconnected peers, who each make 
a portion of their resources (such as processing power, disk storage or network 
bandwidth) directly available to other network participants, without the need 
for centralized coordination by servers”. 

In P2P networks, data are stored at the nodes (or peers) and the most cru¬ 
cial operations are data search and data updates. A P2P network is represented 
by a graph, a logical overlay network, where its nodes correspond to the net¬ 
work nodes, while its edges may not correspond to existing communication links, 
but to communication paths. We assume constant size messages between nodes 
through links and asynchronous communication. It is assumed that the network 
provides an upper bound on the time needed for a node to send a message and 
receive an acknowledgement. In this way, the network provides a mechanism to 
identify communication problems, which may refer to communication links or 
nodes that are down. The complexity (cost) of an operation is measured in terms 
of the number of messages issued during its execution (internal computations at 
nodes are considered insignificant). 

With respect to its structure, the overlay supports the operations Join (of a 
new node v, v communicates with an existing node u in order to be inserted into 
the overlay), and Departure (of an existing node u; u leaves the overlay announc¬ 
ing its intent to other nodes of the overlay). Moreover, the overlay implements 
an indexing scheme for the stored data, supporting the operations Insert a new 
element, Delete an existing element. Search for an element and Range Query for 
elements in a specific range. 

Range query processing in decentralized network environments is a notori¬ 
ously difficult problem to solve efficiently and scalably. In cloud infrastructures, 
a most significant and apparent requirement is the monitoring of thousands of 
computer nodes, which often requires support for range queries: consider range 
queries issued in order to identify under-utilized nodes so as to assign them more 
tasks, or to identify overloaded nodes so as to avoid bottlenecks in the cloud. 
For example, we wish to execute range queries such as: 

SELECT NodelD 
FROM CloudNodes 

WHERE Low < utilization < High 
or both single and range queries such as: 


SELECT NodelD 
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FROM CloudNodes 

WHERE Low < utilization < High AND os = UNIX 

Moreover, in cloud infrastructures that support social network services like 
Facebook, user profiles are stored distributed in several nodes and we wish to 
retrieve user activity information, executing range queries such as: 

SELECT COUNT(userlD) 

FROM CloudNodes 

WHERE 3/1/2015 < time < 3/31/2015 AND userlD = 123456 
AND NodelD IN Facebook 

An acceptable solution for processing range queries in such large-scale decen¬ 
tralized environments must scale in terms of the number of nodes as well as in 
terms of the number of data items stored. For very large volume data (trillions 
of data items at millions of nodes) the classic logarithmic complexity offered by 
solutions in literature, is still too expensive for single and range queries. Fur¬ 
ther, all available solutions incur large overheads with respect to other critical 
operations, such as join/leave of nodes, and insertion/deletion of items. Our aim 
in this work is to provide a solution that is comprehensive and outperforms 
related work with respect to all major operations, such as search, join/leave, 
insert/delete and load-balancing and to the required routing state that must be 
maintained in order to support these operations. In particular, our ultimate goal 
is to achieve a sub-logarithmic complexity for all the above operations. 

In this work, we focus on hierarchical tree-based overlay networks that sup¬ 
port directly range and more complex queries. We introduce a much promis¬ 
ing deterministic decentralized structure for distributed data, called D^-Tree. 
Through experiments, we verify that the implemented structure outperforms 
other well-known tree-based structures, since it provides better complexities re¬ 
garding load-balancing operations. More specifically, the structure achieves an 
amortized bound of O(logfV) {N is the number of nodes present in the net¬ 
work), using an efficient deterministic weight-based load-balancing mechanism, 
which is general enough to be applied to other hierarchical tree-based struc¬ 
tures. Moreover, our structure achieves an 0{\ogN) search performance. Last 
but not least, we investigate the structure’s fault tolerance, which has not been 
sufficiently tackled in its predecessor[T], both theoretically and through rigorous 
experimentation, proving that D^-Tree is highly fault-tolerant. 

Afterwards we go one step further, in order to achieve sub-logarithmic com¬ 
plexity and propose the ART+ structure, exploiting the excellent performance 
of D^-Tree. The outer level of ART+ is an ARTQ structure [TT], built by grouping 
clusters of peers, whose communication cost of query and update operations is 
0(log^ log IV) hops, where the base 6 is a double-exponentially power of two and 
N is the total number of nodes. Moreover, ART is a fully dynamic and fault- 
tolerant structure, which supports the join/leave node operations in 0(log log N) 
expected w.h.p number of hops and performs load-balancing in 0(\oglogN) 
amortized cost. Each cluster-peer of ART+ is organized as a D^-Tree. 

Autonomous Range Tree 


4 
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The rest of this paper is organized as follows: Previous work is presented in 
Section The weight-based mechanism used in D^-Tree is briefly described in 
Sectionj^ Section|^presents our proposed structure by describing the theoretical 
background and discussing the enhancements and implementation aspects. In 
Section we present the ART+ structure which exploits the performance of D^- 
Tree. Section [ghosts the performance evaluation for both structures. The paper 
concludes in Section [3 

2 Related Work 

Extensive work has been done to support search and update techniques in dis¬ 
tributed data, which are crucial operations in P2P networks. In this section 
we discuss the existing solutions referring to their contribution and weaknesses, 
emphasizing in hierarchical tree-based structures. 

Existing structured P2P systems can be classified into two broad categories: 
Distributed Hash Table (DHT)-based systems and tree-based systems. Exam¬ 
ples of the former, which constitute the majority, include Chord, CAN, Pastry, 
Symphony, Tapestry [7] and P-Ring [5]. In general, DHT-based systems sup¬ 
port exact match queries well and use (successfully) probabilistic methods to 
distribute the workload among nodes equally. Since hashing destroys the or¬ 
dering on keys, DHT-based systems typically do not possess the functionality 
to support straightforwardly range queries, or more complex queries based on 
data ordering (e.g., nearest-neighbour and string prefix queries). Some efforts 
towards addressing range queries have been made in |8l8j . getting however ap¬ 
proximate answers and also making exact searching highly inefficient. The most 
recent effort towards range queries is the P-Ring [5]. P-Ring is fully distributed 
and fault-tolerant, provides load-balancing and supports both exact match and 
range queries, achieving 0(log^ A^-|-fc) range search performance in average case 
(N is the number of peers, d is the order of the ring and k is the answer size) 
and 0{d ■ log^^ N + k) in worst case. 

Tree-based systems are based on hierarchical structures. They support range 
queries more naturally and efficiently as well as a wider range of operations, since 
they maintain the ordering of data. On the other hand, they lack the simplicity 
of DHT-based systems, and they do not always guarantee data locality and 
load balancing in the whole system. Important examples of such systems include 
Family Trees [7], BATON [5], BATON* [4] and Skip List-based schemes like 
Skip Graphs (SG), NoN SG, SkipNet (SN), Deterministic SN, Bucket SG, Skip 
Webs, Rainbow Skip Graphs (RSG) and Strong RSG [7] that use randomized 
techniques to create and maintain the hierarchical structure. 

Emphasis should be given to the fact that w.r.t. load-balancing, the solutions 
provided in the literature are either heuristics, or provide expected bounds under 
certain assumptions, or amortized bounds but at the expense of increasing the 
memory size per node. In particular, in BATON [^, a decentralized overlay is 
provided with load-balancing based on data migration. However, their 0{\ogN) 
amortized bound {N is the number of nodes in the network) is valid only subject 
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to a probabilistic assumption about the number of nodes taking part in the data 
migration process, and thus it is in fact an amortized expected bound. Moreover, 
its successor BATON*, exploits the advantages of higher fanout (number of 
children per node), to achieve reduced search cost of 0(log^ N), where m is the 
fanout. However, the higher fanout leads to larger update and load-balancing cost 
of Oim ■ log^ N). On the other hand, in DHT systems, P-Ring [2] maintains a 
load imbalance factor of at most 2 -|- e in a stable system, for any given constant 
e > 0, and has a stabilization process for fixing inconsistencies caused by peer 
failures and updates, achieving an 0{d ■ logj^N) performance. 

As far as network’s fault tolerance is concerned, P-Ring [5] is considered 
highly fault-tolerant, using the Chord’s Fault Tolerant Algorithms [H]. BATON 
maintains vertical and horizontal routing information not only for efficient 
search, but to offer a large number of alternative paths between two nodes. In its 
successor BATON* [3], fault tolerance is greatly improved due to higher fanout. 
When fanout = 2, approximately 25% of nodes must fail before the structure 
becomes partitioned, while increasing the fanout up to 10 leads to increasing 
fault tolerance (60% of failed nodes partition the structure). A comparison of 
the aforementioned structures and our proposed structure is given in Table 


Table 1: Comparison of P-Ring, BATON*, D^-Tree, ART and ART+. 


Structures 

Search key 

Insert/Delete key 
(load-balancing) 

Max. size 
routing table 

of 

Join/Depart peer 
(updating routing 
tables) 

P-Ring 

0(log,N) 

0(d.log,A) 

O(logA) 


0(d.log,A) 

BATON 

O(logA) 

O(logA) 

O(logA) 


O(logA) 

BATON* 

0(log™iV) 

0(m • log^ A) 

0{m ■ log^ A) 


0(m • log„ A) 

D^-Tree 

O(logA) 

O(logA) 

O(logA) 


O(logA) 

ART 

d{\ogl log A) 

0(m • log^ log A) 

0(AC4/log'’A) 

0(m • log„ log A) 

ART+ 

d{\ogl log A) 

O(loglogA) 

0(AC4/log'’A) 

O(loglogA) 


Legend: N-. number of peers, d: order of ring, m: fanout, c > 0, b: double-exponentially 
power of 2, O: expected bound, O: amortized bound, O: expected amortized bound. 


3 A Weight-Based Load-Balancer 

In this section, we describe a weight-based load-balancing mechanism which is 
an efficient solution for element updates in hierarchical tree-based structures. All 
definitions used in this section and throughout this paper, have been compiled 
in Table The main idea of this mechanism is the almost equal distribution 
of elements among nodes by making use of weights^ a metric which shows how 
uneven is the load among nodes. When the load is uneven, then a data migration 
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process is initiated to equally distribute the elements. The method has two steps; 

(a) first, it provides efficient and local update of weight information in a tree 
when elements are added or removed at the leaves, using virtual weights, and 

(b) it provides an efficient load-balancing mechanism which is activated when 
necessary. 


Table 2: Symbols and Definitions 
Symbol Definition 


w{v) : weight of v 
e{v) 

|n| : size of v 
d(v) : density of v 

c(p, q) : criticality 

nCv ■ node criticality 


number of elements stored in the subtree of a (including v) 

number of elements residing in a node v 

number of nodes of the subtree of v (including v) 

d{v) = represents the mean number of elements per 

node in the subtree of v 

c{p, q) = represents the difference in densities between 
brothers p and q 

nCv = represents the difference in size between a node v 
and its left child w 


More specifically, when an element is added/removed to/from a leaf u in a 
tree structure T, the weights on the path from u to the root must be updated. 
This is a costly operation, when it is performed on every element update. Instead 
of updating weights every time, a new metric is defined, virtual weight W(v). 
Assume that node v lies at height h and its children are vi,V 2 , ■ ■ ■ ,Vs at height 
h—1. W{v) of V is defined as the weight stored in node v. In particular, for node 
V the following invariants are maintained: 

Invariant 1. W(v) > e{v) -h (1 — en) {J2i=i 

Invariant 2. W(v) < e{v) -h (1 -I- e'^) {J2i=i W{vi)) 

where eh = ^'h = ^ ■ The constants eh and are chosen such that for all 
nodes the virtual weight will be within a constant factor c > 1 of the real weight, 
i.e., i • w{v) <W{v) < c - w{v). 

When an update takes place at leaf u, the mechanism traverses the path 
from u to the root updating the weights by ±1, until node z is found for which 
Invariants [T] and [2] hold. Let v be its child for which either Invariant [T] or |2] does 
not hold on this path. All weights on the path from u to v are recomputed; for 
each node z on this path, its weight information is updated by taking the sum 
of the weights of its children plus the number of elements that z carries. 

Another Invariant which is maintained and is crucial for the load-balancing 
mechanism, involves the criticality c(p, q) = of two brother nodes p and q 
(representing their difference in densities). The invariant guarantees that there 
will not be large differences between densities: 
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Invariant 3. For two brothers p and q, it holds that ^ < c{p, q) < c, 1 < c < 2 

For example, choosing c = 2 we get that the density of any node can be at 
most twice or half of that of its brother. 

When an update takes place at leaf u, weights are updated as described above. 
Then, the load-balancing mechanism redistributes the elements among leaves 
when the load between leaves is not distributed equally enough. In particular, 
starting from u, the highest ancestor w is located that is unbalanced w.r.t. his 
brother z, meaning that Invariant is violated. Finally, the elements in the 
subtree of their father v are redistributed uniformly so that the density of the 
brothers becomes equal; this procedure is henceforth called redistribution of node 

V. 

The weight-based mechanism is slightly modified to be applied also in node 
updates. Virtual size S{v) is dehned and the same mechanism is applied using 
invariants similar to Invariants and Moreover, a new invariant is defined, 
involving node criticality. 

Invariant 4- The node criticality of all nodes is in the range [j, |]. 

Invariant implies that the number of nodes in the left subtree of a node v 
is at least half and at most twice the corresponding number of its right subtree. 

The weight-based mechanism described above (and its slight modihcation) 
achieves an 0(1) amortized cost for weight update and O(logiV) amortized cost 
for redistribution. It is general enough to be applied to other hierarchical tree- 
based structures and was proposed and thoroughly described in [T]. 

4 The D^-Tree 

In this section, we present our proposed structure, D^-Tree, which introduces 
many enhancements over the solutions in literature and its predecessor [1]. In 
general, a D^-Tree structure with N nodes and n data elements residing on them 
achieves: (i) 0([ogN) space per node; (ii) deterministic 0(log A^) searching cost; 
(iii) deterministic amortized 0{\ogN) update cost both for element updates and 
for node joins and departures; (iv) deterministic amortized 0{logN) bound for 
load-balancing. 

4.1 The Structure 

Let N be the number of nodes present in the network and let n denote the size of 
data {N ^ n). The structure consists of two levels. The upper level is a Perfect 
Binary Tree (PBT) of height O(logiV). The leaves of this tree are representatives 
of the buckets that constitute the lower level of the D^-Tree. Each bucket is a 
set of O(logfV) nodes which are structured as a doubly linked list. The number 
of nodes of the PBT is not connected by any means to the number of elements 
stored in the structure. The structure supports the operations of node join and 
node departure, while at the same time it tackles failures of nodes whenever 
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these are discovered. Each node v of the D^-Tree maintains an additional set of 
links to other nodes apart from the standard links which form the tree: 

1. Links to its father and its children. 

2. Links to its adjacent nodes based on an in-order traversal of the tree. 

3. Links to nodes at the same level as v. The links are distributed in exponential 
steps; the first link points to a node (if there is one) 2° positions to the left 
(right), the second 2^ positions to the left (right), and the i-th link 2*“^ 
positions to the left (right). These links constitute the routing table of v and 
require O(logiV) space per node. 

4. Links to leftmost and rightmost leaf of its subtree. These links accelerate 
the search process and contribute to the structure’s fault tolerance when a 
considerable number of nodes fail. 

5. For leaf nodes only, links to the buckets of the nodes in their routing tables. 
The first link points to a bucket 2° positions left (right), the second 2^ 
positions to the left (right) and the i-th link 2*“^ positions to the left (right). 
These links require 0{\ogN) space per node and keep the structure fault 
tolerant, since each bucket has multiple links to the PBT. 

The next lemma captures some important properties of the routing tables. 

Lemma 1. (i) If a node v contains a link to node u in its routing table, then 
the parent of v also contains a link to the parent of u, unless u and v have the 
same father, (ii) If a node v contains a link to node u in its routing table, then 
the left (right) sibling of v also contains a link to the left (right) sibling of u, 
unless there are no such nodes. (Hi) Every non-leaf node has two adjacent nodes 
in the in-order traversal, which are leaves. 

Regarding the index structure of the D^-Tree, the range of all values stored 
in it is partitioned into sub-ranges each one of which is assigned to a node 
of the overlay. An internal node v with range may have a left child 

u and a right child w with ranges and respectively such that 

Xu < < Xy < x( < Xyj < x'„,. Ranges are dynamic in the sense that they 

depend on the values maintained by the node. 

4.2 Implementation Aspects 

During the implementation process, many of the theoretical aspects of the pro¬ 
posed structure were put under the microscope. The transition from theory to 
practice is full of challenges, leading to issues that either can be solved or have 
to be simplified. Below we describe the key features of the D^-Tree simulator 
and the algorithmic steps behind the operations it supports. 

A key feature of our proposed structure, thanks to which the high perfor¬ 
mance is achieved, is the weight-based mechanism [T], used for node redistribu¬ 
tion after node updates and data load-balancing after element updates. The main 
idea is the almost equal distribution of elements among nodes, using weights, a 
metric which shows how uneven is the load among nodes. The mechanism, which 
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was described in Section lazily updates the weight information on nodes, so 
load-balancing is performed only when it is absolutely necessary. 

Another key feature is the enhanced search mechanism, in case of node fail¬ 
ures. Nodes perform a number of effective horizontal contacts to other nodes 
of the same level and take into account several alternative paths, in order the 
operation to be successful, even when a considerable number of nodes fails. Last 
but not least, the structure is highly fault tolerant, since is supports a procedure 
of node withdrawal, when a node is found unreachable, regardless of its posi- 
tion(internal node, leaf, bucket node). The success of the last two operations is 
due to the small number of additional links a node maintains, through which it 
can reconstruct the routing table of a fallen node. 

4.3 Node Joins and Departures 

Handling node updates. When a node z makes a join request (Alg. to u, 
V forwards the request to an adjacent leaf u. If u is a binary node, the request 
is forwarded to the left adjacent node, w.r.t. the in-order traversal, which is 
definitely leaf (unless u is a leaf itself). In case u is a bucket node, the request 
is forwarded to the bucket representative, which is leaf. Then, node z is added 
to the doubly linked list of the bucket represented by u. In node joins, we make 
the simplification that the new node is clear of elements. It could be entered 
anywhere in the bucket, as the first, the last or an intermediate node, but we 
prefer to place it after the most loaded node of the bucket. Thus, the load is 
shared and the new node stores half of the elements of the most loaded node. 
Finally, the left and right adjacents of the newly inserted node update their links 
to previous and next node and z creates new links to those nodes. 

Algorithm 1. Join (Node newNode) 

REQUIRE initialNode 

IF initialNode is BucketNode THEN 

initialNode.Representative.Join(newNode); 

ELSE IF initialNode is InternalBinaryNode THEN 
initialNode.LeftlnOrderAdjacent.Join(newNode); 

ELSE IF initialNode is Leaf THEN 

mostLoadedNode = initialNode.FindMostLoadedO; 
mostLoadedNode = AcceptAsAdjacent(newNode); 
SpIitDataCmostLoadedNode, newNode); 

END IF 

When a node v leaves the network, it is replaced by an existing node, so 
as to preserve the in-order adjacency. All navigation data are copied from the 
departing node v to the replacement node, along with the elements of v. If v is 
an internal binary node, then it is replaced by its right adjacent node, which is 
a leaf and which in turn is replaced by the first node z in its bucket. If u is a 
leaf, then it is directly replaced by z. If the departing node belongs to a bucket 
no replacement takes place, but simply all stored elements of v are copied to the 
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previous node of v (or to the next node if the previous does not exist). Then v 
is free to depart. 

After a node join or departure, the modified weight-based mechanism is ac¬ 
tivated and updates the sizes by ±1 on the path from leaf u to the root (Alg. 
[^, as long as the defined Invariants [^or|^ do not hold. When the first node w is 
accessed for which Invariants and hold, the nodes in the subtree of its child 
q in the path, have their sizes recomputed (Alg. [^. Afterwards, the mechanism 
traverses the path from leaf u to the root, in order to find the first node (if 
such a node exists) for which Invariant is violated (Alg. and performs a 
redistribution in its subtree. 

Algorithm 2. UpdateVirtualSize (BinaryNode leaf, NodeUpdate operation) 

currentNode = leaf; 

WHILE currentNode.Father != null DO 
IF operation == NodeJoin THEN 
currentNode.Father.Size++; 

ELSE IF operation == NodeDeparture THEN 
currentNode.Father.Size—; 

END IF 

/* if Invariant 1 OR Invariant 2 do not hold */ 

IF currentNode.Father.Size <= minSize OR 
currentNode.Father.Size >= maxSize THEN 
currentNode = currentNode.Father; 

ELSE 

RETURN currentNode; 

END IF 
END WHILE 
RETURN Root; 

Algorithm 3. ComputeSizelnSubtree () 

REQUIRE node 
IF node IS Leaf THEN 

node.Size = node.Bucket.Size; 

ELSE 

node.LeftChild.ComputeSizelnSubtree(); 

node.RightChild.ComputeSizelnSubtree(); 

node.Size = node.LeftChild.Size + node.RightChild.Size; 

node.minSize = (1 - $e_h$) * node.Size; 

node.maxSize = (1 + $e_h$) * node.Size; 

END IF 

Algorithm 4. CheckNodeCriticality(BinaryNode leaf) 

currentNode = leaf.Father; 

WHILE currentNode != null DO 

IF currentNode.NodeCriticality IN [minValue, maxValue] THEN 
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currentNode = currentNode.Father; 

ELSE 

RETURN currentNode; 

END IF 
END WHILE 

RETURN null; /* redistribution isn’t necessary */ 


Node Redistribution. The redistribution guarantees that if there are z nodes 
in total in the y buckets of the subtree of v, then after the redistribution each 
bucket maintains either [z/yj or [z/yj + 1 nodes. The redistribution cost is 
O(logN), which is verified through experiments. 

The redistribution in the subtree of v is carried out as follows (Alg. . We 
assume that the subtree of v at height h has K buckets. A traversal of all buckets 
is performed in order to determine the exact value of |u| (number of nodes in 


the buckets of the subtree of v). Then, the first k buckets, will contain + 1 
nodes after redistribution, where k = \v\ mod 2^. The remaining K — k buckets 
will contain nodes. The redistribution starts from the rightmost bucket b 


and it is performed in an in-order fashion so that elements in the nodes are not 
affected. 

We assume that b has q extra nodes which must be transferred to other 
buckets. Bucket b maintains a link dest to the next bucket b' on the left, in 
which q extra nodes should be put. The q extra nodes are removed from b and 
are added to b'. The crucial step in this procedure is that during the transfer 
of nodes, internal nodes of PBT are also updated, since the in-order traversal 
must remain untouched. More specifically, the representative z of 6 as well as 
the its left in-order adjacent w are replaced by nodes of bucket b and then, z and 
w along with the remaining g — 2 nodes of b are added to the tail of bucket b'. 
Afterwards, the horizontal and vertical links for the replaced binary nodes are 
updated, as well as for nodes that point to them. Finally, bucket b informs b' to 
take over and the same procedure applies again with b' as the source bucket. 

The case where q nodes must be transferred to bucket b from bucket b' is 
completely symmetric. In general, q nodes are removed from the tail of bucket 
b', two of them replace nodes w and z and the remaining nodes are added to the 
head of b. The intriguing part comes when b' contains less nodes than the q that 
b needs. In this case, b has to find the remaining nodes in the buckets on the left, 
so dest travels towards the leftmost bucket of the subtree, until q < 1^*1) 

where \bi\ is the size of the i — th bucket on the left. Then, nodes of bg move to 
5s-i, nodes from 6s_i are transfered to bs -2 and so on, until dest goes backwards 
to b' and q nodes are moved from b^ into b. 

The redistribution cost is 0{logN) and is verified through experiments pre¬ 
sented analytically in this work. 


Algorithm 5. RedistributeSubtree() 
REQUIRE BinaryNode node 
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node.ComputeSizelnSubtree 0; 
newBucketSize = node.Size / 2~h; 
leftMostLeaf = node.LeftMostLeaf; 
currentNode = node.RightMostLeaf; 

WHILE currentNode != leftMostLeaf DO 
currentBucket = currentNode.Bucket; 

destinationBucket = currentNode.LeftAdjacents[0].Bucket; 

IF currentBucket.Size > newBucketSize THEN 

nodesToMove = currentBucket.Size - newBucketSize; 

FOR i=0; i < nodesToMove; i++ DO 

Replace(currentNode, currentBucket.FirstNode) 

Replace(currentNode.LeftAdjacent, currentBucket.Representative); 
destinationBucket.InsertLast(currentNode.LeftAdjacent); 
UpdateAllLinks0; 

END FOR 

ELSE IF currentBucket.Size < newBucketSize THEN 

nodesToMove = newBucketSize - currentBucket.Size; 
tempDestNode = currentNode.LeftAdjacents[0]; 

WHILE tempDestNode != leftMostLeaf AND 
availableNodes < nodesToMove DO 

availableNodes += tempDestNode.Size; 
tempDestNode = tempDestNode.LeftAdjacents[0]; 

END WHILE 

tempNodesToMove = tempDestNode.Size - 
(totalNodesAvailable - nodesToMove); 

WHILE tempDestNode != currentNode DO 

/* ... similar to previous case ...*/ 

/* go back to the right, moving nodes */ 

END WHILE 
END IF 

currentNode = currentNode.LeftAdjacents[0]; 

END WHILE 


Extension - Contraction. Throughout joins and departures of nodes, the size 
of buckets can increase undesirably or can decrease so much that some buckets 
may become empty. Either situations violate the D^-Tree principle for bucket 
size |5|: 

oi log < |6| < 02 logiV (1) 

where 0 < oi < 02 - The structure guarantees that Q is always true, by em¬ 
ploying two operations on the PBT, extension and contraction (Figure[^. These 
operations are activated when a redistribution occurs at the root of the PBT 
and they add one level to the PBT when |6| >02 logfV or delete one level from 
the PBT when |6| < oi log 

The extension (Alg. is carried out as follows: the last level of the PBT is 
affected and a new level of leaves, as well as, a new set of buckets are created. 
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using nodes from the old buckets. In particular, each leaf u and its bucket B 
with nodes 6 i, 62 , 6 s are replaced by a 2 -level binary subtree T 2 preserving, 
at the same time, the in-order adjacency. Thus, the left leaf of T 2 is the old leaf 


the root of To is the i — th node of h, where i = 


the bucket Bi of u 


contains nodes 61 ,..., 6 i-i, the right leaf of T 2 is the {i -\- 1) — th node of b and 
its bucket B 2 contains the remaining nodes of old bucket B, 6 ^+ 2 ,..., 6 s- During 
the process, all in-order adjacency links and father-child links are updated. After 
the extension has been carried out, leaves and nodes in height 6 = 1 reconstruct 
their routing tables and all binary nodes update the link to the rightmost leaf 
of their subtre^ 


Algorithm 6 . PerformExtension() 

REQUIRE BinaryNode Root 
currentLeaf = Root.LeftMostLeaf; 

WHILE currentLeaf != null DO 

grandFather = currentLeaf.Father; 
oldBucket = currentLeaf.Bucket; 
i = Math.Ceiling(oldBucket.Size/2); 

Bl = oldBucket.RemoveNodes(i); 
newFather = oldBucket.RemoveFirstNode(); 
rightLeaf = oldBucket.RemoveFirstNode(); 

B2 = oldBucket.RemoveRemainingNodesO; 
grandFather.RepIaceChiId(currentLeaf,newFather); 
newFather.SetChildrenCcurrentLeaf, rightLeaf); 
UpdatelnOrderAdjacencyLinksO; 
currentLeaf = currentLeaf.RightAdjacents[0]; 

END WHILE 
UpdateAIILinks0; 

Root.ComputeSizelnSubtree 0; 

Algorithm 7. PerformContraction() 

REQUIRE BinaryNode Root 

currentNode = Root.LeftMostLeaf.Father; 

WHILE currentNode != null DO 

currentLeaf = currentNode.LeftChild; 

newFather = currentNode.Father; 

newBucket = currentLeaf.Bucket; 

newBucket.Add(currentLeaf.Father); 

newBucket.Add(currentNode.RightChild); 

newBucket.Add(currentNode.RightChild.Bucket); 

newFather.RepIaceChiId(currentNode, currentLeaf); 

UpdatelnOrderAdjacencyLinksO; 

currentNode = currentNode.RightAdjacents[0]; 


5 


the leftmost leaf is not affected 
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END WHILE 

UpdateAllLinks0; 

Root.ComputeSizelnSubtree 0; 

The contraction (Alg. is carried out as follows: the last level of the PBT 
is deleted and every pair of adjacent buckets is merged into one bucket. In 
particular, each 2-level subtree T 2 at height h = 1 and its buckets Bi and B 2 
are replaced by a leaf u and a bucket B, preserving, at the same time, the in- 
order adjacency. The leaf u is basically the left leaf of T 2 and the new bucket 
B contains the nodes of Bi, the root of T 2 , the right leaf of T 2 and finally the 
nodes of bucket B 2 . During the process, all in-order adjacency links and father- 
child links are updated. After the contraction has been carried out, the leaves 
reconstruct their routing tables and all binary nodes update their rightmost leaf. 

It is obvious that these two operations are quite costly, since they involve a 
reconstruction of the overlay, but this reconstruction rarely happens. 


y/ y/ y// 'y, 'y, // 






■SiS 


Fig. 1: The initial D^-Tree structure (middle) and the operations of extension 
(left) and contraction (right). 


4.4 Single and Range Queries 

The search for an element a may be initiated from any node v at level 1. If v 
is a bucket node, then if its range contains a the search terminates, otherwise 
the search is forwarded to the bucket representative, which is a binary node. In 
case n is a node of the PBT (Alg. [^, let 2; be the node with range of values 
containing a, a G [xz,x'^] and assume w.l.o.g. that x'^ < a. The case where 
Xy > a\s completely symmetric. First, we perform a horizontal binary search at 
the level I of v using the routing tables, searching for a node u with right sibling 
w (if there is such sibling) such that x'^ < a and Xw > a, unless a is in the range 
of u or in the range of any visited node of I and the search terminates. 

More specifically, we keep travelling to the right using the rightmost links of 
the routing tables of nodes in level I (the most distant ones), until we find a node 
q such that Xq > a, or until we have reached the rightmost node qr of level 1. If 
the first case is true, then a is somewhere between q and the last visited node in 
the left of <7, so we start travelling to the left decreasing our travelling step by 
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1. We continue travelling left and right, gradually decreasing the travelling step, 
until we find the siblings u and w mentioned above in step = 0. If the second 
case is true, then < a and according to the in-order traversal the search is 
confined to the right subtree of qr. 

Algorithm 8. Search(integer element) 

REQUIRE BinaryNode currentNode, step != 0 
BinaryNode leftMostNode, rightMostNode; 

/* horizontal search */ 

WHILE step != 0 DO 

IF currentNode.Contains(element) THEN 
RETURN currentNode; 

END IF 

IF currentNode.UpperValue <= element THEN 

IF currentNode == RightMostNodeInLevel THEN 
RETURN currentNode.SearchSubtree(element); 

END IF 

leftMostNode = currentNode; 

IF rightMostNode != null THEN 
step—; 

currentNode = currentNode.RightAdjacents[step]; 

ELSE 

currentNode = currentNode.RightAdjacents.Last; 
step = currentNode.RightAdjacents.Count; 

END IF 

ELSE IF currentNode.LowerValue > element THEN 
/*... similar to previous case ...*/ 

END IF 
END WHILE 

/* step == 0, vertical search */ 

BinaryNode leftSibling, rightSibling; 

IF currentNode.UpperValue <= element THEN 
leftsibling = currentNode; 

rightsibling = currentNode.RightAdjacents[0]; 

ELSE IF currentNode.LowerValue > element THEN 
leftsibling = currentNode.LeftAdjacents[0]; 
rightsibling = currentNode; 

END IF 

targetNode = leftSibling.SearchSubtree(element); 

IF targetNode == null THEN 

targetNode = leftSibling.SearchAncestor(element); 

IF targetNode == null THEN 

targetNode = rightSibling.SearchSubtree(element); 

END IF 
END IF 

RETURN targetNode; 
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Having located nodes u and w, the horizontal search procedure is terminated 
and the vertical search is initiated. Node z will either be the common ancestor 
of u and w (Alg. 10), or in the right subtree rooted at u (Alg. [^, or in the left 
subtree rooted at w. If u is a binary node, it contacts the rightmost leaf y of its 
subtree, li Xy > a then an ordinary top down search from node u will suffice to 
find z. Otherwise, node z is in the bucket of y, or in its right in-order adjacent 
(this is also the common ancestor of u and w), or in the subtree of w. If z belongs 
to the subtree of w, a symmetric search is performed. 

When z is located, if a is found in z then the search was successful, otherwise 
a is not stored in the structure. The search for an element a is carried out in 
0{\ogN) steps and is verified through experiments presented in this work. 

A range query [a, b] initiated at node v, invokes a search operation for element 
a. Node z that contains a returns to v all elements in its range. If all elements 
of u are reported then the range query is forwarded to the right adjacent node 
(in-order traversal) and continues until an element larger than b is reached for 
the first time. 


Algorithm 9. SearchSubtree(integer element) 

REQUIRE BinaryNode currentNode 
IF currentNode.Contains(element) THEN 
RETURN currentNode; 

END IF 

IF currentNode IS Leaf THEN 

RETURN currentNode.Bucket.Search(element); 

END IF 

IF currentNode.UpperValue <= element THEN 
rightLeaf = currentNode.RightMostLeaf; 

IF rightLeaf.Contains(element) DO 
RETURN rightLeaf; 

END IF 

IF rightLeaf.UpperValue <= element THEN 
RETURN rightLeaf.Bucket.Search(element); 

ELSE 

/* search internal subtree */ 
currentNode = currentNode.RightChild; 

WHILE currentNode != null DO 

IF currentNode.Contains(element) THEN 
RETURN currentNode; 

END IF 

IF currentNode.Node.UpperValue <= element THEN 
currentNode = currentNode.RightChild; 

ELSE IF currentNode.Node.LowerValue > element THEN 
currentNode = currentNode.LeftChild; 

END IF 
END WHILE 

RETURN currentNode.Bucket.Search(element); 
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END IF 

ELSE IF currentNode.LowerValue > element THEN 
/*... similar to previous case ...*/ 

END IF 
RETURN null; 

Algorithm 10. SearchAncestor(integer element) 

REQUIRE BinaryNode currentNode 
IF currentNode.UpperValue <= element THEN 
IF currentNode IS Leaf THEN 

ancestor = currentNode.RightlnOrderAdjacent; 

ELSE 

ancestor = currentNode.RightMostLeaf.RightInOrdetAdjacent; 
END IF 

ELSE IF currentNode.LowerValue > element THEN 
/*... similar to previous case ...*/ 

END IF 

IF ancestor.Contains(element) THEN 
RETURN ancestor; 

END IF 
RETURN null; 


4.5 Element Insertions and Deletions 

Handling element npdates. Assume that an update operation (insertion/ 
deletion) is initiated at node v involving element a. By invoking a search opera¬ 
tion, node u with range containing element a is located and the update operation 
is performed on u. Element a is inserted in u or is deleted from it, depending on 
the request. 

In order to apply the weight-based mechanism for load balancing, the element 
should be inserted in a bucket node (similar to node joins) or in a leaf. However, 
node It can be any node in the structure, even an internal node of PBT. In case 
u is a bucket node or a leaf, a is inserted to it and no further action is necessary. 
If It is an internal node of the PBT, element a is inserted in u and then the 
first element of u (note that elements into nodes are sorted) is removed from 
It and inserted to q, the last node of the bucket of the left adjacent of it, in 
order to preserve the sequence of elements in the in-order traversal. This way, 
the insertion has been shifted to a bucket node. The case of element deletion is 
similar. 

After an element update in leaf it or in its bucket, the weight-based mech¬ 
anism is activated and updates the weights by ±1 on the path from leaf it to 
the root, as long as Invariants or do not hold. When the first node w is 
accessed for which Invariants and hold, then the nodes in the subtree of its 
child q in the path, have their weights recomputed. Afterwards, the mechanism 
traverses the path from leaf it to the root, in order to find the first node (if such 
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a node exists) for which Invariant is violated and performs a load-balancing in 
its subtree. 


Load Balancing. The load-balancing mechanism guarantees that if there are 
w{v) elements in total in the subtree of v of size |r!| (total number of nodes in 
the subtree of v including v), then after load-balancing each node stores either 
1 elements. The load-balancing cost is 0{\ogN), which is 


w{v 

"R" 


J w(v) 


verified through experiments, presented analytically in this work. 

The load-balancing in the subtree of v is carried out as follows (Alg. 11). 
We assume that v has |f | nodes in its subtree. A bottom-up computation of the 
weights in all nodes of the subtree is performed, in order to determine the weight 

j -I- 1 elements after load- 
v\ — k nodes will contain 


w{v) 
\v\ 

balancing, where k = w(y) mod |r;|. The remaining 


']{v) of V. Then, the first k nodes will contain 


elements. The redistribution starts from the rightmost node w of the 
rightmost bucket b and it is performed in an in-order fashion. 

We assume that w has m extra elements which must be transferred to other 
nodes. Node w has a link to node w’ in which the m extra elements should be 
inserted. In order to locate node w', we take into consideration the following 
cases: (i) if w is a bucket node, w' is its left node, unless w is the first node of 
the bucket and then w' is the bucket representative, (ii) if w is a leaf, then w' 
is the left in-order adjacent of w and (iii) if w is an internal binary node, then 
its left in-order adjacent is a leaf and w' is the last node of its bucket. Having 
located w', the first m extra elements of w are removed from w and are added to 
the end of the element queue of w\ in order to preserve the indexing structure 
of the tree. Then, the ranges of w and w' are updated respectively. 

The case where m elements must be transferred to node w from node w' is 
completely symmetric. In general, the last m elements of w' are removed from w' 
and are inserted in the first m places in the element queue of w. The intriguing 
part is when w' contains less elements than the m elements that w needs. In 
this case, dest travels towards the leftmost node of the subtree, following the 
in-order traversal, until m < where e{ui) is the number of elements 

of the i — th node on the left. Then, elements of node Us are transferred to rts-i, 
elements from rts_i are transferred to Us -2 and so on, until dest goes backwards 
to w' and m elements are moved from w' into w. 


Algorithm 11. LoadBalanceSubtree() 

REQUIRE BinaryNode node 
node.ComputeSizelnSubtree0; 
node.ComputeWeightInSubtreeO; 
newNodeDensity = node.Weight / node.Size; 
leftMostLeaf = node.LeftMostLeaf; 
currentNode = node.RightMostLeaf; 

WHILE currentNode != leftMostLeaf DO 

destinationNode = node.FindDestinationNode(); 
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IF currentNode.Elements > newNodeDensity THEN 

elementsToMove = currentNode.Elements - newNodeDensity; 
SplitElements(currentNode, destinationNode, elementsToMove); 

ELSE IF currentNode.Elements < newNodeDensity THEN 

elementsToMove = newNodeDensity - currentNode.Elements; 
tempDestNode = destinationNode; 

WHILE tempDestNode != leftMostLeaf AND availableElements < 
elementsToMove DO 

availableElements += tempDestNode.Elements; 
tempDestNode = tempDestNode.FindDestinationNodeO; 

END WHILE 

tempElementsToMove = tempDestNode.Elements - 
(totalElementsAvailable - elementsToMove); 
tempSourceNode = tempDestNode.FindSourceNode(); 

WHILE tempDestNode != currentNode DO 

SplitElements(tempDestNode, tempSourceNode, tempElementsToMove); 
tempDestNode = tempSourceNode; 
tempElementsToMove = tempDestNode.Elements; 
tempSourceNode = tempSourceNode.FindSourceNode(); 

END WHILE 

END IF 

currentNode = destinationNode; 

END WHILE 

4.6 Fault Tolerance 

Searches and updates in the D^-Tree do not tend to favour any node, and in 
particular nodes near the root, which are therefore not crucial and their failure 
will not cause more problems than the failure of any node. However, a single 
node can be easily disconnected from the overlay, when all nodes with which 
it is connected fail. This means that 4 failures (two adjacent nodes and two 
children) are enough to disconnect the root. The most easily disconnected nodes 
are those which are near the root, since their routing tables are small in size. 

When a node w discovers that v is unreachable, the network initiates a node 
withdrawal procedure by reconstructing the routing tables of u, in order v to be 
removed smoothly, as if v was departing. If v belongs to a bucket, it is removed 
from the structure and the links of its adjacent nodes are updated. 

In case v is an internal binary node, its right adjacent node u is first located, 
making use of Lemmrj^ in order to replace v. More specifically, we assume that 
node w discovered that v is unreachable during some operation. Taking into 
account all possible relative positions between w and v, we have the following 
cases, in which we want to locate the right child q oi v that will lead as to u. 

First, if w and v are on the same level, by Lemmej^i) we locate q and thus the 
right adjacent node u of u is the leftmost leaf in the subtree of q. Being more 
clear, if w is connected to v by the f-th link of its routing table, then its right 
child is connected to q by the (i + l)-th link of its routing table. Second, if w is 
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the father of v, by LemmE[^i) its left (right) child p has a link to the missing 
node V and the right child of p has a link to g, so u is located. Third, if w is the 
left (right) child of v, then u is easily located. 

In case n is a leaf, then it should be replaced by the first node u in its bucket. 
However, in D^-Tree predecessor, if a leaf was found unreachable, contacting 
its bucket would be infeasible, since the only link between v and its bucket 
would have been lost. This weakness was eliminated in D^-Tree, by maintaining 
multiple links towards each bucket, distributed in exponential steps (in the same 
way as the horizontal adjacency links). This way, when w is unable to contact 
ri, it contacts directly the first node of its bucket u and u replaces v. 

In any case, the elements stored in v are lost. Moreover, the navigation data 
of u (left adjacent of v) are copied to the first node z in its bucket which takes 
its place, and u has its routing tables recomputed. 


4.7 Single Queries with Node Failures 

The problem of searching an element in a network where a number of nodes 
have fallen, introduces some very intriguing aspects and it can be considered 
as two-dimensional, since the search must be both successful and cost effective. 
A successful search for element a refers to locating the target node z for which 
a e [xzj x’^\. An unsuccessful search refers to the cases where (i) z is unreachable 
and, (ii) there is a path to z but the search algorithm couldn’t follow it to locate 
z, due to failures of intermediate nodes. The D^-Tree predecessor doesn’t provide 
a search algorithm in case of node failures, since it doesn’t sufficiently confront 
the structure’s fault tolerance. In the following, we present the key features of 
our search algorithm, mostly through examples, due to the complexity of its 
implementation. 

The search procedure is similar to the simple search described in section 
14.41 One difference in horizontal search lies in the fact that if the most distant 
right adjacent of v is unreachable, v keeps contacting its right adjacent nodes 
by decreasing the step by I, until it finds node q which is reachable. If all right 
adjacents are unreachable, v contacts its left adjacents, afterwards it tries to 
contact its children, its father and as a last chance, when all other nodes have 
failed, it contacts its left/right in-order adjacents and its left/rightmost leaf. 
Contacting children, in-order adjacents and leaves means a change in the search 
level. 

In case x'^ < a the search continues to the right using the most distant right 
adjacent of q, otherwise the search continues to the left and q contacts its most 
distant left adjacent p which is in the right of v. If p is unreachable, q doesn’t 
decrease the travelling step by 1, but contacts directly its nearest left adjacent 
(at step = 0) and asks it to search to the left. This improvement reduces the 
number of messages that are meant to fail, because of the exponential positions of 
nodes in routing tables and the nature of binary horizontal search. For example, 
in Fig. search starts from vg and vg contacts vy since V 4 has fallen. No node 
contacts V 4 from then on and the number of messages is reduced by 2. 
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Fig. 2: Example of binary horizontal search with node failures 


A vertical search to locate z is always initiated between two siblings u and 
w, which are either both active, or one of them is unreachable, as shown in 
Fig where the left sibling u is active and w, the right one, is unreachable. In 
both cases, first we search into the subtree of the active sibling, then we contact 
the common ancestor and then, if the other sibling is unreachable, the active 
sibling tries to contact its corresponding child (right child for left sibling and 
left child for right sibling). When the child is found the search is forwarded to 
its subtree. We assume (w.l.o.g.) that the left sibling u is active and w, the right 
one, is unreachable, as shown in Fig.[^ First u contacts its rightmost leaf y of its 
subtree. If y is reachable and Xy > a or ii y is unreachable, then an ordinary top 
down search from node u is initiated to find z. If a node in the searching path 
is unreachable, a mechanism, which is described below, is activated to contact 
its children. Otherwise, if Xy < a, node z is in the bucket of y, or in its right 
in-order adjacent (this is also the common ancestor of u and w), or in the left 
subtree of w. 


Root 



[LKCk>--6;; 


Fig. 3: Example of vertical search between u and unreachable w 


In general, when node u wants to contact the left (right) child of unreachable 
node w, the contact is accomplished through the routing table of its own left 
(right) child. If its child is unreachable, (Eig.l^ then u contacts its father Uf and 
Uf contacts the father of w, w/, using LemmiaHi), unless u and w have the same 

































22 E. Sourla, S. Sioutas and K. Tsichlas and C. Zaroliagis 

father. Then Wf, using LemmrQii) twice in succession, contacts its grandchild 
through its left and right adjacents and their grandchildren. 

In case initial node is a bucket node, then if its range contains a the search 
terminates, otherwise the search is forwarded to the bucket representative. If the 
bucket representative has fallen, the bucket contacts its other representatives 
right or left, until it finds a representative that is reachable. The procedure 
continues as described above for the case of a binary node. 


5 The ART+ structure 

In this section we briefly describe and present the theoretical background of 
ART+. ART+ is similar to its predecessor, ART[TT] regarding the structure’s 
outer level. Their difference, which introduces performance enhancements, lies 
in the fact that each cluster-peer of ART+ is structured as a D^-Tree[T]. 


Building the ART+ structure. The backbone structure of ART+ is simi¬ 
lar to lrtQ in which some interventions have been made to improve its per¬ 
formance and increase the robustness of the whole system. ART’*' is built by 
grouping cluster-peers having the same ancestor and organizing them in a tree 
structure recursively. A cluster-peer is defined as a bucket of ordered peers. The 
innermost level of nesting (recursion) will be characterized by having a tree in 
which no more than b cluster-peers share the same direct ancestor, where b is 
a double-exponentially power of two (e.g. 2, 4, 16,...). Thus, multiple indepen¬ 
dent trees are imposed on the collection of cluster-peers. The height of ART+ is 
0(loglogj N) in the worst case. The ART+ structure remains unchanged w.h.p. 
Figure [^illustrates a simple example, where 6 = 2. 

The degree of the cluster-peers at level f > 0 is d{i) = t{i), where t{i) indicates 
the number of cluster-peers at level i. It holds that d(0) = 6 and t(0) = 1. At 
initialization step, the 1st peer, the (Inn + 1) — th peer, the (2 ■ \nn + 1 ) — th 
peer and so on are chosen as bucket representatives, according to the balls in 
bins combinatorial game presented in [6]. Let n be w-bit keys, N be the total 
number of peers and N' be the total number of cluster-peers. Each peer with 
label i (where 1 < f < N) of a random cluster, stores ordered keys that belong 
in the range [(f — l)lnn,ilnn — 1], where N = n/lnn. Each cluster-peer with 
label i' (where 1 < i' < N') stores ordered peers with sorted keys belonging in 
the range \{i' — 1) In^ n,..., i' In^ n — 1], where N' = nj In^ n ot N' = N/ Inn is 
the number of cluster-peers. 

ART+ stores cluster-peers only, each of which is structured as an independent 
decentralized architecture, which changes dynamically after node join/leave and 
element insert/delete operations inside it. In contract to its predecessor, ART, 
whose inner level was structured as a BATON*, each cluster-peer of ART+ is 
structured as a D^-Tree. Each cluster-peer is equipped with a routing table 


LRT: Level Range Tree 
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Fig. 4: The ART+ structure for b = 2. 


named Random Spine Index (RSI), which stores pointers to cluster-peers be¬ 
longing to a random spine of the tree (instead of the LSQ of LRT which stores 
pointers to the peers of the left-most spine). Moreover, instead of using fat 
tables, which store pointers to the collections of peers presented at the same 
level, the appropriate collection of cluster-peers is accessed by using a 2-level 
LRT structure. In ART'*', the overlay of cluster-peers remains unaffected in the 
expected case w.h.p. when peers join or leave the network. 


Load Balancing. The operation of join/leave of peers inside a cluster-peer is 
modelled as the combinatorial game of bins and balls presented in [^ . In this way, 
for an ^(•) random sequence of join/leave peer operations, the load of each cluster 
peer never exceeds 6>(log N) size and never becomes zero in expected w.h.p. case. 
In skew sequences, though, the load of each cluster-peer may become 0{N) in 
worst case. The load-balancing mechanism for a D^-tree structure, as described 
previously, has an amortized cost of O(logiF), where K is the total number of 
nodes in the D^-tree. Thus, in an ART+ structure, the cost of load-balancing is 
0(loglog7V) amortized. 


Routing Overhead. The 2-level LRT is an LRT structure over log^'^ Z buckets 
each of which organizes Z/ log^'^ Z collections in a LRT manner, where Z is the 
number of collections at current level and c is a big positive constant. As a 
consequence, the routing information overhead becomes / \og‘^ N) in the 

worst case. 

^ LSI: Left Spine Index 
® Cl: Collection Index 
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Search Algorithms. Since the structure’s maximum number of nesting levels 
is 0(log^log A) and at each nesting level i we have to apply the standard LRT 
structure in collections, the whole searching process requires 0{\ogl log A) 
hops. Then, we have to locate the target peer by searching the respective decen¬ 
tralized structure. Through the polylogarithmic load of each cluster peer, the 
total query complexity 0(}ogl log A) follows. Exploiting now the order of keys 
on each peer, range queries require 0(logjlogA -|- |A|) hops, where |A| is the 
answer size. 


Join/Leave Operations. A peer u can make a join/leave request to a peer 
V, which is located at cluster peer W. Since the size of W is bounded by a 
polylogN size in expected w.h.p. case, the peer join/leave can be carried out 
in 0(loglog A) hops. The outer structure of ART+ remains unchanged w.h.p. 
as mentioned before, but each D^-tree structure changes dynamically after peer 
join/leave operations. According to D^-Tree performance evaluation, the peer 
join/leave can be carried out in 0(loglog A) hops. 


Node Failures and Network Restructuring. In the ART+ structure, sim¬ 
ilarly to ART, the overlay of cluster-peers remains unchanged in the expected 
case w.h.p., so in each cluster-peer the algorithms for node failure and network 
restructuring are according to the decentralized architecture used. D^-Tree is a 
highly fault-tolerant structure, since it supports procedures for node withdrawal 
and handles massive node failures efficiently. 

6 Experimental Study 

We built a simulatoi]^ with a user friendly interface and a graphical represen¬ 
tation of the structure, to evaluate the performance of D^-Tree. In Fig. we 
present the user interface of the D^-Tree simulator. At the top left, the user 
can construct a new tree structure after setting the tree parameters. At the 
top right, the user can conduct experiments regarding node joins/departures, 
element insertions/deletions, single and range queries, setting the parameters 
for each experiment. In the centre of the screen, useful information is displayed 
regarding the tree construction and experiments. At the bottom, a graphical 
representation of the structure is displayed. 

6.1 Performance Evaluation for D®-Tree 

To evaluate the cost of operations, we ran experiments with different number 
of nodes A from 1,000 to 10,000, in order to be directly compared to BATON, 
BATON* and P-Ring. BATON* is a state-of-the-art decentralized architecture 

® Our simulator is a standalone desktop application, developed in Visual Studio 2010, 
available in https://github.com/sourlaef/d3-tree-sim 
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Fig. 5: The D^-Tree simulator user interface. 


and P-Ring outperforms DHT-based structures in range queries and achieves a 
slightly better load-balancing performance compared to Baton*. For a structure 
of N nodes, 1000 x N elements where inserted. We used the number of passing 
messages to measure the performance of the system. 


Cost of Node Joins/Departures. To measure the network performance for 
node updates, we conducted experiments for three different value ranges of node 
criticality: [0.25 0.75], [0.35 0.65] [0.45 0.55]. For a network of N initial nodes, we 
performed 2N node updates. Figure!^ shows the average amortized redistribu¬ 
tion cost, while [6b| depicts the same metrics in worst case, where the same node 
(the leftmost leaf) is the host node for all node joins. Note that we have taken 
into account only the amortized cost of node joins causing redistributions, since 
otherwise, the amortized cost is negligible. Figure 6c shows the redistribution 
rate for D^-Tree in average and worst case. 

Through experiments, we observed that even in the worst case scenario, the 
D^-Tree node update and redistribution mechanism achieves a better amortized 
redistribution cost, compared to that of BATON, BATON* and P-Ring. We 
also observed that in the average case, during joins and departures of nodes, 
the Invariant 1^ is rarely violated to invoke a redistribution operation (Fig. [^. 
This also depends on the range in which node criticality belongs. When the 
range is narrowed, more redistributions take place during node updates, but the 
amortized cost is low, since the majority of redistributions occur in subtrees of 
low height, as shown in Fig. 6d This is more obvious in worst case. There, when 
we use wide ranges, more node joins take place before a redistribution occurs. 
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making the redistribution operation more costly, since a great number of nodes 
have been cumulated into the bucket of the leftmost leaf. 

Figure |6d] shows in detail the allocation of redistribution height for different 
node criticality ranges in a network of 10,000 initial nodes. We observe that 
in worst case the number of redistributions is more than twice the number of 
redistributions of the average case. 





(c) redistribution rate of winner D®-Tree 



(d) redistr/ion height of winner D®-Tree 


Fig. 6: Node Update operations 


Cost of Element Insertions/Deletions. To measure the network perfor¬ 
mance for the operation of element updates, we conducted experiments for three 
different value ranges of criticality. [0.90 1.10], [0.67 1.50], [0.53 1.90] formed 
from values of constant c = {1.1,1.5,1.9} correspondingly. For a network of N 
nodes and n elements, we performed n element updates. Figure [^ shows the 
average amortized load-balancing cost, while Fig. |7b| shows the load-balancing 
rate. Both average cases and worst cases are depicted in the same graph. The 
average cases for c values of 1.5 and 1.9 led to negligible amortized cost so they 
were disregarded. In worst case, the same node (the leftmost leaf) is the host 
node for all element insertions. Note that we have taken into account only the 
amortized cost of element insertions causing load-balancing operations, since 
otherwise, the amortized cost is negligible. 
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Conducting experiments, we observed that in the average case, the D^-Tree 
outperforms BATON, BATON* and P-Ring. However, in D^-Tree’s worst case, 
the load-balancing performance is degraded compared to BATON* of fanout = 
10 and P-Ring. Moreover, we observed that in the average case, during element 
updates, the Invariant is rarely violated to invoke the load-balancing mech¬ 
anism (Fig. 7bI. This also depends on the value range of criticality. When the 
range is narrowed, more load-balancing operations take place during element 
updates, but the amortized cost is low since the subtree isn’t very imbalanced, 
although the majority of load-balancing operations occur in subtrees of high 
height, as shown in Figure On the other hand, when we use wide ranges, 
more element updates take place before the load-balancing mechanism is acti¬ 
vated, leading to more frequent and costly operations of load-balancing. 



(c) load-balancing height of winner D^-Tree 
Fig. 7: Element Update operations 


Cost of Element Search with/without Node Failures. To measure the 
network performance for the operation of single queries, we conducted experi¬ 
ments in which for each TV, we performed 2M [M is the number of binary nodes) 
searches. The search cost is depicted in Fig. An interesting observation here 
was that although the cost of search in D^-Tree doesn’t exceed 2 • log N, it is 
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higher that the cost of BATON, BATON* and P-Ring. This is due to the fact 
that when the target node is a Bucket node, the search algorithm, after locating 
the correct leaf, performs a serial search into its bucket to locate it. 

To measure the network performance for the operation of element search 
with node failures, we conducted experiments for different percentages of node 
failures: 10%, 20%, 30%, 50% and 75%. For each N, we performed 2M {M is the 
number of binary nodes) searches divided in 4 sets. A different set of nodes failed 
in each of the 4 sets. Figure |Fb| depicts the increase in search cost when massive 
failures of nodes take place in D^-Tree, BATON, different fanouts of BATON* 
and P-Ring. We observe that D^-Tree maintains low search cost, compared to 
the other structures, even for a failure percentage > 30%. 



(a) average messages without failures 





NumberOI Nodes 

—7S« 


(d) success percentage of D®-Tree 


Fig. 8: Single Queries without/with node failures 


Details about the behaviour of the enhanced search mechanism of D^-Tree in 
case of node failures, are depicted in Fig. ^ and Fig. which show the average 
number of messages required and the success percentage respectively. Experi¬ 
menting, we observed that when the node failure percentage is small (10% to 
15%), the majority of single queries that fail are the ones whose elements belong 
to failed nodes. When the number of failed nodes increases, single queries are not 
always successful, since the search mechanism fails to find a path to the target 
node although the node is reachable. However, even for the significant node fail- 
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ure percentage of 30%, our search algorithm is 85% successful, confirming thus 
our claim that the proposed structure is highly fault-tolerant. 


6.2 Performance Evaluation for ART+ 

We also evaluated the performance of ART+ structure and compared it to its 
predecessor, ART [TT]. Each cluster peer of the ART and ART+ is a BATON* 
and D^-Tree structure respectively. BATON* was implemented and evaluated 
in [4], while ART was evaluated in m, using the Distributed Java D-P2P-Sim 
simulator presented in uni- The source code of the whole evaluation process, 
which showcases the improved performance, scalability, and robustness of ART 
over BATON* is publicly availabl^^ For the performance evaluation of ART+, 
we used the D^-Tree simulator. 

To evaluate the performance of ART and ART+ for the search and load¬ 
balancing operations, we ran experiments with different number of total nodes 
N from 50,000 to 500,000. As proved in m , each cluster peer stores no more than 
0.75 log^ A peers in smooth distributions (normal, beta, uniform) and no more 
than 2.51og^A peers in non-smooth distributions (powlow, zipfian, weibull). 
Moreover, we inserted elements equal to the network size multiplied by 2000, 
which are numbers from the universe [1...1,000, 000, 000]. We used the number 
of passing messages to measure the performance. 

Note here that, as proved in ART outperforms BATON* in search 
operations, except for the case where 6 = 2. Moreover, ART achieves better 
load-balancing compared to BATON*, since the cluster-peer overlay remains 
unaffected w.h.p. through joins/departures of peers and the load-balancing per¬ 
formance is restricted inside a cluster-peer. Consequently, in this work, ART+ 
is compared directly to ART. 


Cost of Search Operations. To measure the network performance for the 
search operations (single and range queries), we conducted experiments for dif¬ 
ferent values of 6, 2, 4 and 16, in which for each N, we executed 1,000 single 
queries and 1,000 range queries. The search cost is depicted in Fig. Both nor¬ 
mal (beta, uniform) and worst cases (powlow, zipfian, weibull) are depicted in 
the same graph. Experiments confirm that the query performance of ART and 
ART+ is O(log^logA) and the slight performance divergences are due to the 
fact that BATON*, as the inner structure of ART’s cluster-peer, performs better 
that D^-Tree in search operations. 

In case of massive failures, the search algorithm has to find alternative paths 
to overcome the unreachable peers. Thus, an increase in node failures results in 
an increase in search costs. To evaluate the system in case of massive failures, 
we initialized the system with 10,000 peers and let them randomly fail without 
recovering. At each step, we check if the network is partitioned or not. Since 
the backbone of ART and ART+ remains unaffected w.h.p., the search cost 


10 


http://code.google.com/p/d-p2p-sim/ 
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is restricted inside a cluster-peer (BATON* or D^-Tree respectively), meaning 
that b parameter does not affect the overall expected cost. Figure |M| illustrates 
the effect of massive failures. We observe that both structures are fault tolerant 
since the failure percentage has to reach the threshold of 60% to partition them. 
Moreover, even in the worst case scenario, the ART+ maintains lower search 
cost compared to ART, since D^-Tree handles node failures more effectively 
than BATON*. 






Fig. 9: Cost of search operations 


Cost of Load-Balancing Operations. To evaluate the cost of load-balancing, 
we tested the network with a variety of distributions. For a network of N total 
nodes, we performed 2N node joins. Both ART and ART+ remain unaffected 
w.h.p., when peers join or leave the network, thus the load-balancing perfor¬ 
mance is restricted inside a cluster-peer (BATON* or D^-Tree respectively), 
meaning that b parameter does not affect the overall expected cost. The load¬ 
balancing cost is depicted in Fig. 10 Both normal and worst cases are depicted 
in the same graph. 
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Experiments confirm that ART+ has an O(loglogiV) load-balancing perfor¬ 
mance, instead of the ART performance of 0{m ■ log^logiV). Thus, even in 
the worst case scenario, the ART+ outperforms ART, since D^-Tree has a more 
efficient load-balancing mechanism than BATON* (Fig. 10). 
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Fig. 10: Cost of load-balancing operation 


7 Conclusions 

In this work, we presented a dynamic distributed deterministic structure, called 
D^-Tree. Our proposed structure introduces many enhancements over the solu¬ 
tions in literature and its predecessor, regarding load-balancing and fault toler¬ 
ance. We presented in brief the theoretical algorithmic analysis, in which D^-Tree 
is based on, and we described thoroughly the key aspects of the implementation. 
Verifying the theory, we have proved through experiments that D^-Tree outper¬ 
forms other well-known tree-based structures, by achieving an O(logiV) amor¬ 
tized bound in the most costly operation of load-balancing, even in a worst case 
scenario. Moreover, investigating the structure’s fault tolerance, both theoreti¬ 
cally and through experiments, we proved that D^-Tree is highly fault tolerant, 
since, even for massive node failures of 30%, it achieves a significant success rate 
of 85% in element search, without increasing the cost considerably. 

Afterwards we went one step further, in order to achieve sub-logarithmic 
complexity and proposed the ART"*" structure, exploiting the excellent per¬ 
formance of D^-Tree. We proved that the communication cost of query op¬ 
erations, element update and node join/leave operations of ART+ scale sub- 
logarithmically expected w.h.p. Moreover, the cost for the load-balancing oper¬ 
ation is sub-logarithmic amortized. Experimental comparison to its predecessor, 
ART, showed slightly less efficiency towards search operations (single and range 
queries), but improved performance for the load-balancing operation and the 
search operations in case of node failures. Moreover, experiments confirmed that 
ART+ is highly fault tolerant in case of massive failures. Note that, so far, ART 
outperforms the state-of-the-art decentralized structures. 
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